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1, INTRODUCTION 



Most speech coding algorithms operating at rates of 
around 8 kb/a attempt to reproduce the original 
speech waveform. Their efficiency of reproducing the 
waveform is obtained by using models which exploit 
knowledge of the generation of the speech signal. In 
contrast, most coders operating at rates of around 2.4 
kb/s are completely parametric usually transmitting 
parameters describing the pitch and the spectral en- 
velope at regular intervals. However, because of 
model inadequacies, the quality of reconstruction of 
current parametric methods never reaches that of the 
original signal, even at high bit rates. 

In this paper, a new method which is positioned 
between the waveform coders and the parametric 
coders is presented. It is based on the assumption 
that, for voiced speech, a perceptually accurate speech 
signal can be reconstructed from a description of the 
waveform of a single, representative pitch cycle per 
interval of 20-30 ms. Figure 1 shows the smooth evo- 
lution of the shape of the pitch cycle, which is typical 
for voiced speech signals. We will show how such a 
signal can be reconstructed by interpoiatingprototype 
pitch cycles between the updates. The prototype- 
waveform interpolation (PWI) method retains the 
natural quality typical of coders which encode the en- 
tire waveform, but requires a bit rate close to that of 
the parametric coders. 

We discuss PWI methods based on linear predic- 
tion (LP). In LP-based speech coders, the signal is 
reconstructed from knowledge of the predictor coeffi- 
cients and a description of the excitation signal. Of 
the existing LP-based algorithms, the code-excited 
linear-prediction (CELP) algorithm [1] and the LP 
vocoder [2] are examples of waveform and parametric 
coders 7 respectively. 



In the simplest form of CELP the speech waveform 
is described by time-varying LP filter coefficients and 
a filter excitation consisting of the concatenation of 
scaled fixed-length vectors from a codebook. To 
achieve high efficiency during voiced speech, most 
implementations include a long-term predictor [3], 
or adaptive codebook [4) P to facilitate periodicity of 
the reconstructed signal- Despite recent improve- 
ments [5, 6], inaccurate reproduction of the periodic- 
ity remains the main source of perceptual distortion 
in the current CELP algorithms at rates below 6 kb / s. 

In the LP-based vocoders the voiced speech signal 
is modeled by a single pulse per pitch cycle. Because of 
excessive periodicity, this often teads to a buzzy char- 
acter of the reconstructed speech. Recent work has 
shown that the speech quality can be improved signifi- 
cantly by £ more information about the evolving 
waveform shape. Using a cluster of pulses for each 
pitch cycle, with blockwise shape adaptation, in com- 
bination with a smoothly varying overall gain pro- 
duced good results [7]. Alternatively, good-quality 
voiced speech can be obtained at rates of around 3 
kb/s by careful placement of the single-pulse loca- 
tions [8, 9 ] . Although significantly improved over the 
LP-based vocoders, and similar in quality to 4.8 kb/s 
CELP, such single-pulse excited ( SPE) speech coders 
still suffer from some buzziness. 

Both the CELP and the SPE methods attempt to 
reproduce the original waveform by using a (spec- 
trally weighted) signal-to-noise ratio (SNR) of the 
reconstructed speech signal as a criterion to deter- 
mine the excitation sequence- However, maintaining 
the periodicity of the original speech signal is impor- 
tant for its perceptual quality, and maximisation of 
the SNR often leads to a nonoptimal degree of peri- 
odicity. Thus, it was found in both the CELP { 6 ] and 
the SPE coders [9] that improved speech quality can 
be obtained by increasing the periodicity, despite an 
associated reduction in SNR. 
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FIG. 1, A 14-ura segment ©f a voiced speech signal, uttered by a 
female speaker end bend- limited to 4 kHx. 



Maintaining a smooth pitch track and the correct 
degree of periodicity is fundamental to time-scale 
modification of speech, making it useful to look at 
these methods from a speech coding viewpoint. Re- 
cently, excellent results in time-scale modification 
were obtained with methods which are pitch synchro- 
nous [10, 11] , The basic idea behind these procedures 
is to add pitch cycles for slowing down the speech rata, 
and to eliminate pitch cycles for increasing the speech 
rate- The success of these methods implies that, in 
order to maintain a good speech quality, transmission 
of information during each pitch cycle is not essential, 
but that it is important to maintain a high correlation 
between neighboring pitch-cycle waveforms, and to 
provide a good description of these waveforms. In 
other words, these time-scaling techniques suggest 
that less frequent updates of the waveform, combined 
with interpolation, may result in an efficient en- 
coding- . 

Recent developments in speech coding by sinusoi- 
dal reconstruction also point toward the importance 
of maintaining the correlation between the waveform 
of successive pitch cycles. Traditionally, coders using 
sinusoidal reconstruction have been sensitive to rever- 
beration. Recently, it was shown that reverberation 
could be eliminated by maintaining phase coherence 
during successive pitch cycles [12 J, In informal ex- 
periments, we have found that the reconstructed 
speech signal has a reverberant quality when it is mod- 
ified by an added component which changes rapidly 
over time, but which itself has a harmonic structure 
similar to that of the original speech signal. Thus, a 
reverberant character is caused by an added compo- 
nent which is correlated from one pitch-cycle to the 
next, while a noisy character (such as that of basic 
CELP) is caused by an added component that is not 
correlated between successive pitch cycles. In both 
noisy and reverberant speech signals, it is the dy- 
namics of the pitch-cycle waveform that is disturbed. 

These results suggest that, in voiced speech, it is 
important to maintain the original dynamics of the 
pitch-cycle waveform. In PWI this is accomplished by 
interpolation in combination with other features. The 
PWI procedure can be seen as a generalization of the 



original vocoder concept, which transmits not only 
the pitch period and the LP filter specification but 
also a prototype excitation waveform at each update; 
Similar to CELP, an analysis-by-synthesis method is 
used to quantize the excitation waveform. However, 
in the PWI method a single pitch cycle is quantized 
every 20-30 ms 7 whereas in CELP the speech wave- 
form is quantized on a frame-by-frame basis. The ex- 
citation waveform and the filter parameters are inter- 
polated independently between updates. Alterna- 
tively, . the method can b e inter preted as a simple 
vocoder with single pulses exciting a time- dependent 
pole-zero filter, where the excitation waveform pro- 
vides the coefficients of the all-zero filter. 

The PWI method avoids the common problems 
which are caused by incorrect dynamics of the pitch- 
cycle waveform. Noise, buzziness, and reverberation 
can be controlled because the speech signal is recon- 
structed on a pitch-cycle by pitch-cycle basts. In cases 
where noise is present in the original signal, such as in 
speech with a breathy character, this can be modeled 
by adding white noise to the excitation signal. Rever- 
beration and noise are controlled by maintaining the 
correlations between sequential pitch-cycle wave- 
forms similar to that of the original signal. By main- 
taining the level of periodicity of the original speech 
signal, excellent sounding reconstructed speech can 
be obtained at bit rates of 2-5 to 4 kb /s. 

We proceed as follows In the next section, we de- 
scribe the principles of the PWI in more detail. In 
Section 3 T we provide the practical details for imple- 
menting the method. In Section 4 we show how the 
method performs. We conclude in Section 5 with a 
summary of the main distinguishing features of che 
PWI method, 

2. PRINCIPLES OF THE PWI METHOD 



2,2. Separation of Spectral Envelope and 
Spectral Fine Structure 

The two major features of a speech spectrum are 
the formant structure (the spectral envelope) and the 
fine structure. As a rough model, the fonnant struc- 
ture is determined by the vocal tract shape, while the 
fine structure is determined by the vocal cords. We 
consider the formant structure as being independent 
of the fine structure and assume that these two fea- 
tures evolve separately over time. For good results, 
these features must be interpolated separately in the 
PWI method. This can be done by deconvolving the 
speech signal into an excitation signal which is white, 
but maintains the spectral fine structure (pitch) . and 
a description of the formant structure. 
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It is convenient to model the formants with stan- 
dard LP techniques. The spectral envelope can he en- 
coded and interpolated using well-known procedures. 
For this reason, we have chosen to describe the proto- 
type pitch cycle with a set of LP filter coefficients on 
the one hand, and the pitch and a description of the 
excitation waveform over one pitch cycle on the other 
hand- The excitation waveform is constructed by in- 
terpolation of the prototype excitation waveforms. 
The reconstructed speech signal is obtained by filter- 
ing this excitation waveform with a time-varying all- 
pole filter, similarly as in other LP-based techniques- 
Subsection 2.2 describes the basic construction of the 
excitation waveform by interpolation, while Subsec- 
tion 2.3 describes the explicit control of noise and re- 
verberation- 

2.2. Interpolation of the Excitation Waveform 

As a first step in the LP-based PWI reconstruction 
method, the excitation is computed. It is obtained 
from interpolation of the prototype excitation wave- 
forms, each describing one pitch cycle. To obtain good 
quality for the reconstructed speech signal, its pitch 
contour must be sufficiently smooth, and the correla- 
tions between adjacent pitch cycles must be suffi- 
ciently high. In this section we will discuss several 
methods satisfying these conditions. We start out 
with describing two blockwise interpolation proce- 
dures and then we discuss a continuous interpolation 
method. For convenience of notation, we will describe 
the excitation as a continuous, rather than a sampled, 
signal. 

2.2.1. Blockwise interpolation with fixed time 
scale. Let p(k) be the pitch period of an arbitrary 
pitch cycle h 7 and let the current interpolation inter- 
val start in the center of pitch cycle k =- 0, with pitch 
period p(0), and end in the center of pitch cycle k 
= K 7 with pitch period p(K). We interpolate the 
pitch period linearly with the pitch-cycle index hi 

p(A S )=^^p(0)+|p(JC) 7 

ft = 0,1 K. (1) 

The time locations, t k , of the centers of each of the 
pitch periods, ft, are obtained by simply adding the 
pitch periods of prior pitch cycles: 

*» = to+^2 (P(i)+P(/-D) 

4 J-l 

= t o +p(0)ft-h(p(X)-p<0))-^, 



Note that this implies that, for the entire interpola- 
tion interval, 



(3) 



This simple relationship holds if one interpolates lin- 
early in the pitch-cycle index ft; linear interpolation 
in time leads to lees elegant results, but performs 
equally well. 

Let us denote the prototype excitation waveforms 
associated with pitch cycle 0 and K by u(0, r) and 
v(K, r), respectively, where u(0, t) is defined on the 
interval [^^p(0> v ipiO-)-)--aAdvC£ r rlis defined on 
the interval [-£p(10, $p{K))- !□ the first interpo- 
lation method we define extended, centered prototype 
excitation waveforms by zero-padding, 



u(m, r) = 



|0> 



elsewhere, 



(4) 



where m takes the values 0 and K. The tilde indicates 
that the extended prototypes u(m, r) are centered 
around the origin. This centering does not mean that 
the features ( e.g., the pitch pulse) of successive proto- 
type excitation waveforms are aligned. The prototype 
waveform function u(m, r) denotes the waveform 
after proper alignment with the previous prototype 
excitation waveform. The alignment procedurejion- 
sists of shifting r) over a distance % K so as to 
nriminize a distortion measure D(u(0, t) 7 u(K, r 



argrnin D(u(0. t), u(K, t - 



We then have 



u(K.r)-u(tf f T-$jc). 



(5) 



(6) 



To prevent divergence of the offset it is conve- 
nient to recenter the prototype waveform associated 
with pitch cycle 0 of the interpolation interval, prior 
to interpolation. That is, the of the next interpola- 
tion interval corresponds to r K . +.fx °* .*ke. present 
interpolation interval. The definition of the offset for 
pitch cycle 0, 



(7) 



A = 0, (2) 



will be used from here on. 

In the present blockwise interpolation method, the 
intermediate excitation pitch-cycle waveforms uih 
r),. . . ,u(JC— l,r) are obtained from linear interpo- 
lation with the pitch-cycle index: 
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u(M)=^^t)4 tt(Jf ' r) ' 



K 



k - 0, 1, . - - , K . (8) 



To obtain the actual pitch-cycle excitation waveform 
i/( h, t ) , for pitch cycle k 7 the function u(fe» t) is trun- 
cated to the proper length p{k) of Eq. ( 1) : 

-ip( W .< t <: |p(fe>, = o, l f . . ^ (d) 

The excitation waveform x(0 is obtained by concat- 
enation of the truncated waveforms, starting at 

x(t) - 2 Mh, t - t h )s{p{h) + ^ , t - * - -| fx] , 



where H(a, t) is a window function: 

1, -±a*St<±a, 



0, 



elsewhere. 



(11) 



The length of the windows for the individual pitch 
cycles within Eq. ( 10 ) does not equal the pitch period, 
because of the alignment shift The simple depen- 
dence of the window function length on results 
again from the fact that the interpolation is linear 
with the pitch-cycle index. If the truncation operation 
can be neglected, then the procedure of zero-padding 
fallowed by linear interpolation corresponds to linear 
interpolation of the complex spectrum of the proto- 
type excitations, on a pitch-cycle by pitch-cycle basis. 
The interpolation procedure is illustrated in Fig. 2- 
In this example the left prototype excitation wave- 
form is a centered, band-limited impulse, and the 
right prototype waveform is an offset band-limited 
impulse (of lower cut-off frequency) - The right pro- 
totype has a pitch period 50% larger than that of the 
left prototype. The left prototype requires a large 
amount of zero-padding during th© interpolation, as is 
seen in Pig. 2b- However, the right prototype also re- 
quires zero-padding, because of the offset of the im- 
pulse. In general, the zero-padding can give rise to 
discontinuities at the block boundaries, and within 
the blocks at the ending of the prototype waveforms. 

Because of the discontinuities, it is essential for the 
blockwise methods that areas of high energy in the 
excitation waveform (such as pitch pulses) are 
known, such that the endpoints of the prototype wave- 
forms can be located where they have the least im- 
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FIG. Z. Blackwiae interpolation with fi^ed time scale: (a) the 
interpolated excitation waveform. <b) th* contribution pom the 
kft prototype excitation waveform, and (c) the contribution of the 
right prototype ^citation waveform. The vertical lines innate 
the block boundaries and the horizontal bars delineate the proto- 
type excitation waveforms. 



pact. The discontinuities can be eliminated with a 
Simple modification of the present formalism. The 
pitch periods v 0 (r ) and u*(r) can be defined over an 
interval somewhat longer than one pitch period, and 
the square window H(a ? t) can be replaced by an asym- 
metric tapered window extending to the center of the 
pitch cycles before and after the present pitch cycle. 
Then Eq. (10) describes an overlap-add procedure, 
which results in a smooth transition between adjacent 
interpolated pitch-cycle excitation waveforms. How- 
ever, to obtain proper alignment, it is good'pblicy to 
locate the centers of the prototype waveform near 
pitch pulses, even when overlap-add procedures are 

US The implementation of the blockwise interpolation 
with fixed time scale proceeds as follows. Initially* the 
synthesizer is p ro vided with p ( 0 ) , p ( K ) , v( 0, r ) , v ( K, 
r ) , t 0 ■ and a "desired" endpoint - First the proto- 
types are aligned according to Eqs. (5) and (6). By 
substituting r^u* fbr t x in Eq. <3 ) , anonmteger value 
ia obtained for K, which is rounded up to the next 
integer- Then the actual t K is computed using Eq. ( 3 ) . 
From this the locations of all the intermediate pitch 
cycles are computed using Eq. ( 2 ) . Finally, the entire 
excitation function ia computed by using Eqs. ( 8 ) and 
(10). After reentering the last prototype excitation 
waveform of the present interval; the interpolation 
over the next interval can proceed. 

2,2.2. Blockwise interpolation with tint* scaling. 
In the previous blockwise interpolation procedure we 
zero-padded the excitation waveform, aligned the fu- 
turemost prototype with the previous prototype, and 
then interpolated- In the present blockwi 3 e interpola- 
tion procedure, the prototype waveform, u(fe, r). is 
considered to be One cycle of a periodic function u ( k 7 
r), 
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~v(k,mod(T + ±p[k),plk)) -$p{k)) a (12) 

where the modulus function mod (a, b) = a — & int(a/ 
b ) i3 used* Before alignment and interpolation of the 
prototype excitation waveforms, their pitch periods, 
p(0) andp(if), are normalized to unity. The normal- 
ized ( dimensionless ) time scale is denoted by r . Since 
the periodic function goes through one cycle for a un- 
ity increase in r, we identify 2?rr as the pitch-cycle 
phase (the phase of the fundamental harmonic). To 
obtain the aligned waveform u(K, p{K)r) = u{K, 
p(fC)(r - £*)) of the time-normalized excitation 
waveform we minimize a distortion criterion: 

l K argmin 0(^(0^(0)?), 

u(K,p(X)(r -&)>). (13) 

The interpolation of the time-normalized waveforms, 
followed by time denormauzation by a factor p(fe), 
results in the interpolated excitation waveform for 
pitch cycle k: 

u(Jfe,r) - u(fe,p(fe)r) 

= ~^u(0 f p(0)f)+|/x(K,p(K)f) f 

ft = 0,l # ...,K. (14) 

The excitation waveform is obtained by concatena- 
tion of the deformalized waveforms of the individual 
pitch cycles according to Eq> ( 10) with f K = p(*0 Ik* 




PIG. S. Blockwise interpolation with time scaling; U ) the inter- 
polated excitation waveform, (b) the contribution from the left 
prototype excitation waveform, and (e) the contribution of the 
right prototype excitation waveform. The vertical lines indicate 
the block boundaries And the horizontal bars delineate the proto- 
type excitation waveforms. 



The method is illustrated in Fig. 3. The prototype 
excitation waveforms are identical to those of Pig. 2. 
These prototype waveforms make up one pitch cycle 
of a smooth, periodic function (Section 3 describes 
methods for extracting prototype excitation wave- 
forms with this property) . Because of the periodicity 
of the waveforms, discontinuities are now located at 
the block boundaries only. The discontinuities result 
from two causes: the discontinuities in the amplitude 
of the contributions of both prototypes, and discon- 
tinuities in the phase of the harmonics which make up 
the periodic functions describing the prototype wave- 
forms. The latter cause of discontinuity disappears 
when the pitch period is constant, Le. s p {0)*=p{K) « 
Thus, if the pitch period is constant, then the discon- 
tinuities are solely due to the discrete steps in the 
scaling factors of Eq- (14), 

The blockwise interpolation with time scaling pro- 
ceeds similar to that with fixed time scale- In the fre- 
quency domain, the time scaling corresponds to fre- 
quency scaling such that all the harmonics of the peri- 
odic signals u (fc, r ) , h - 0, . . . , are lined up. Thus, 
fbia interpolation amounts to linear interpolation of 
the amplitudes of the harmonics of the speech signal, 
on a pitch-cycle by pitch-cycle basis. 

2.2.3. CoTitinucus interpolation. The previous Sub- 
section showed that blockwise interpolation wixh 
time scaling leads to discontinuities at the block 
boundaries. These discontinuities can be eliminated 
by replacing the discrete pitch-cycle index k with a 
continuous function of time, the continuous pitch-cy- 
cle index *(£)- Thus, the instantaneous pitch period 
evolves now according to 

pU) -p(*(t» - (X " g <U)) p(to)-H^P(^). 

Q^K(t) <:K, (15) 

where p ( t) represents the instantaneous pitch period 
as a function of time t . Time, 1 9 and the instantaneous 
pitch-period index, k - <(£), are related according to 

= fo+P(*b)K+ (PU*) -P^^' 

0 <«(*)< IT, (16) 

which is the equivalent of Eq. (2). The inverse rela- 
tionship is 
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pUx)-p(0 

p(t K ) = P(*b)- 



(17) 



p(*>" 



The length of the interpolation interval is still t K - t<> 
= {Kf2)(p{t*)+p(t K )). 

The periodic function of Eq- (12) is now a continu- 
ous function of the pitch-period index x( t) and is de- 
noted uU(t)> t). Thus, the two periodic functions 
defining the prototype waveforms at the endpoints of 
the interpolation interval ( *o **e u(0 t pCej,)?) 

and a (K, p ( t K )r) . The augnment procedure is then 

} K = argmin D(u(0, pU 0 )*)> 

u{R,p(t K ){f~l? K )))> (13) 

The blockwise interpolation of Bq- ( 14 ) is easily modi- 
fied to continuous interpolation. The instantaneous 
excitation pitch-<ycle waveform at k is 

u(*, t) = u'(k, d(k)t) 

= [ K ~ jl u{Q,p(Q)t ) 4- 1 u(X f p(IOr) f 

0<£x<K. (19) 

Equation ( 19 ) ensures continuity of the magnitude of 
the waveforms over time. The excitation waveform, 
x(t), on the interpolation interval is now obtained by 
concatenation of infinitesimal segments of the instan- 
taneous excitation pitch-cycle waveforms. Consider a 
concatenation of sections of length dt =p(*)d«. Over 
this infinitesimal interval, the pitch-cycle phase 
increased by 2*dr - (Zir/p(£))4f. This shows that, 
for a signal with continuous phase, we have that t 
= + constant- Setting the constant equal to zero 
we have 



x(t)«u(je(e) w p(*(t))jcU)). 



(20) 



In Eq. (20) the first argument determines the excita- 
tion waveform, and the second argument determines 
the point of this waveform to be used at x(C). It is 
convenient to choose the interpolation interval to be 
an integer number of pitch periods. An example of the 
resulting interpolation is shown ha Fig. 4. The time 
scale of the figure and the prototype excitation wave- 
forms are identical to those of Figs. 2 and 3- Figure 4 
displays an integer number of pitch cycles, starting 
from the center of the left prototype waveform. For 
the case shown where K - 4, additional prototype 



waveforms must be known for interpolation outside 
this range. 

Thus, the continuous interpolation algorithm pro- 
ceeds as follows- The synthesizer has given p(ig), 
p tt K ) r u{0, r ) , u{K, t) , t 0> and a "desired" endpoint 
f^. First alignment is obtained with Eqs. (18) and 
(6) (again> £ K * P(*ir>5>- Equation (17) is used to 
compute *(i^uJ> If an integer K is desired, then 
Kit^f^J is rounded to an integer, K, and t K is ob- 
tained with Eq- ( 16 ) . To create a sampled output sig- 
nal, we compute k for each sample point with Eq. 
( 17) . Then we compute the appropriate point of the 
instantaneous excitation pitch-cycle waveform with 
(20) and (19), The continuous interpolation de- 
scribed by Eqs- (16) -(20) corresponds to continuous, 
linear interpolation of the amplitudes of the harmon- 
ics of the excitation signal, 

2.3. Adjust^nerttof the Degree of Periodicity 

Usage of the basic PW1 discussed in the previous 
subsection produces excellent quality for speech sec- 
tions which are highly periodic. However, in other 
speech signals the reconstructed signal may at times 
sound somewhat buzzy or reverberant. These distor- 
tions are the result of too much periodicity and peri- 
odic noise, respectively. In this section we will discuss 
how additional features of the FWI coder can elimi- 
nate these distortions. To this purpose, it is necessary 
to first define appropriate distortion measures for the 




FIG. 4. Continuous interpolation; (a) the interpolated excitation 
waveform, (b) the contribution from che left prototype excitotxon 
waveform, and (e) the contribution of the right proco type excite- 
Cion wHVflfonn. The prototype excitation waveforms are shown 
in<d). 
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prototype waveforms. These criteria can also be used 
for the alignment procedures necessary for interpola- 
tion, as described in Subsection 2.2, and the quantiza- 
tion process, which will bo discussed in Section 3. 

2.3.1. Definition of a distortion measure for proto- 
type waveforms. Consider two excitation wave- 
forms, as denned by Eq. (4), u{k,r) andu(Z, t) f and 
pitch period p (A) andp(J). Further, let ft (0 denote 
the impulse response of a weighting filter- The func- 
tion of this filter is to spectrally weight the excita- 
tions, such that the difference between the filtered 
prototype excitation waveforms is perceptually rele- 
vant. The filter adds the formant structure of the 
speech signal, but in a (^emphasized Torm, (The 
deemphasis reflects spectral masking of the human 
auditory system [13 ] -) The spectral weighting is simi- 
lar to the weighting used during the search of the 
codeboofc in the CELF algorithm [11 . 

We now define two useful distortion measures be- 
tween the spectrally weighted excitation waveforms. 
The first distortion measure is a distortion measure 
that is invariant with the energy of the individual 
waveforms, while the second is not. Consider the fol- 
lowing, normalized error energy between u(k, t) and 
t) (X is an as yet undefined scaling factor), 

D x (u(r),i;(r)) 

_ J* (Mt)*u(t) ~ >Ji{T)*v{r))*dr 



(21) 



where * denotes convolution. For X — 1 this equation 
provides the second distortion measure. The distor- 
tion measure invariant with the energies of the vec- 
tors is obtained by determining the value for X that 
minimizes the distortion measure of Eq. 21. In that 
case we obtain 



Note that Eq. (22) is symmetric in u{k, r) and u(L 
it provides a normalized, energy-invariant, and 
symmetric measure for the differences between these 
two excitation waveforms. 

The definition of the distortion measures of Eqs. 
(21) and (22) was aimed mainly at the nonperiodic 
waveforms of Eq. (4) . For the case that the prototype 
excitation waveforms are assumed to be periodic ( as 
in Eq, (12)), and are time-scaled during interpola- 
tion, it is useful to define distortion criteria which 
operate On the difference between excitation wave- 
forms of normalized pitch. Because the waveforms are 
periodic, it is convenient to evaluate the distortion 
measure in the frequency domain. It is then also possi- 
ble to restrict the criterion to a particular freuency 
band, or to add frequency-domain weighting. Let 
H(v>) be the Fourier transform of h(r ) and let U^(k) 
and V n (l) he the complex Fourier series coefficients 
of u(ft F r) and u(Z» r), respectively. In general, the 
pitch periods of v(fe, r) and u(i, r), denoted by p { k ) 
and p (I) , are not identical. The energy-invariant dis- 
tortion measure is 

D apt (u(fc,T) p tf(J, t), mo, m x ) 

m-mg| V P II w-mql \ P I 

■ 423) 

where the tilde indicates that we are dealing with 
pitch-normalized periodic waveforms, and where p is 
an appropriate time scaling for the filter. DJut^T), 
v(l, r), m 0r nO is defined similarly. Equation (23) 
compares the prototype waveforms on a harmonic- 
by-harmonic basis, starting with harmonic arid 



« 1- 



D-(u<ft,r).i>ar)) 



= 1- lim 



(Jf y ft(r)-u(fe, t)/i(t)M*, r)ar)* 



(22) 



ending with harmonic m^* By setting the time scaling 
of the weighting filter to p, we have effectively time- 
scaled the pitch periods of both^proto^^jexcit^tipn _ . 
waveforms to this value, and then applied the spectral 
weighting* Thus, if u{k 9 r) is the unquantized resid- 
ual, and v ( J, t ) is some codebook entry, which is evalu- 
ated for its match, then it is best to choose p = p ( k) - If 
we want to compare two excitation waveforms at dif- 
ferent times in the same residual signal ( as in Subsec- 
tion 2.3.2), then it is reasonable to choose p ^ (p (fc) 
+ P ( 0 ) /2 whenp ( h) and p(Z) are close (the compari- 
son will become less usefiil when p{k) andp(Z) are 
very different). 



2.3.2. The signal-to-ckange ratio. In the context 
of the PWI coding method, it is convenient to define a 
measure of the degree of periodicity, the signal- to- 
change ratio (SCR) . The S5C5R is simpfythelnvefse of 
the distortion between two prototype waveforms of 
the same signal, as measured with the distortion mea- 
sure of Eq. (22) or (23). 

For coding purposes, we describe the periodicity of 
the speech signal with a long-term and a short-terra 
SCR- We define the long-term SCR as the SCR be- 
tween prototype waveforms separated by 20-30 ms, 
This time interval is made to coincide wich the update 
rate of the coding system (the rate at which a proto- 
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type is extracted from the speech signal) . The short- 
term SCR is defined as the SCR between successive 
pitch cycles. 

In PWI, lowering of the long-term SCR for voiced 
speech segments results in a reverberant quality. If 
both short-term and long-term SCR are too high, a 
buzzy quality emerges, A short-term SCR lower than 
that of the original signal results in a noisy character- 
Thus, maintaining the SCR of the original signal is 
essential for good speech quality. When the speech 
signal is highly periodic, and the prototype waveforms 
are not quantized, the interpola tion meth ods dis- 
cussed in Subsection 2.2 will do this, resulting in 
high-quality speech. 

By adjusting the magnitude of the difference in 
waveform shapes between successively transmitted 
prototypes, the long-term SCR can be controlled. The 
SCR on the original prototypes (which would be used 
in an unquantized system) is measured, and the SCR 
of the quantized system is constrained to be not larger 
than this value. Constraining the long-term SCR re- 
sults in a larger difference between original and re- 
constructed speech waveforms, but the reconstructed 
speech increases in perceptual quality, by removing 
the reverberation. 

When constraining the SCR of reconstructed 
speech, it is important to consider the effect on the 
individual frequency bands separately- If a single 
SCR for the entire speech bandwidth is constrained to 
be identical to that of the original, then the SCR for 
the higher frequency bands is usually too high- As a 
result, the suppression of reverberation by maintain- 
ing a constant long-term SCR over the entire speech 
bandwidth will often result in a buzzy quality. It is 
better to suppress reverberation separately in several 
frequency bands. Note that this is particularly conve- 
nient if the extended prototype is assumed to be peri- 
odic and Eq. (23 ) can be used- 

For speakers with low fundamental frequency. i.e., 
with a pitch period near, or exceeding, 20 ms, the origi- 
nal long-term SCR and the prototype description 
provide sufficient information for high-quality speech 
synthesis. However, for speakers with a shorter pitch 
period, the interpolation procedure may introduce too 
much short-term periodicity- For these speakers (a 
majority), the short-term SCR of lheT^ns6fucted 
speech is larger than that of the original signal The 
short-term SCR can be corrected by adding noise to 
the reconstructed signal- This simple addition of 
white noise to the excitation signal is feasible since 
the noise component of the original signal is gener- 
ated by turbulence and has no structure- 
In practice, maintaining the short-term SCR such 
that a perceptually good speech quality results is rela- 
tively straightforward. Thus, it was found experimen- 



tally that, for voiced speech, the noise contribution 
can be kept constant without audible distortion for 
most speakers, and only small distortions for the re- 
maining speakers. For this result, a small amount of 
noise, increasing with frequency, is injected at fre- 
quencies beyond 2 kH2. 

3. IMPLEMENTATION OF THE PWI 
CODING ALGORITHM 

In this section, practical implementations of the 
PWI algorithm are discussed. We limit our discussion 
to the critical issues of extraction, quantization, and 
interpolation of the prototype waveforms, 

3.1. Extraction of Prototype Excitation 
Waveforms 

Two methods for extracting the prototype wave- 
form from the original signal are reported in this sec- 
tion. Both procedures use an initial pitch estimate. 
We obtained these estimates using the modified auto- 
correlation algorithm [14,' 15]. The first extraction 
procedure is a development of the pitch-marker algo- 
rithm used previously in single-pulse excitation [9] , 
while the second procedure searches for a pitch cycle 
using a new, so-called ma ximufn -prediction-gain cri- 
terion. Both prototype extraction methods require 
comparable computational effort and have similar ro- 
bustness, If sstunates concerning the level of the SCR 
of the original speech are required, then the algorithm 
based on the maximum-prediction-gain criterion pro- 
vides the added advantage of high temporal resolu- 
tion. 

3.1.1. Prototype excitation waveform extraction 
based on pitch markers. The purpose of a pitch- 
marker algorithm is to detect the beginnings of each 
pitch cycle in voiced speech- Thus, two adjacent pitch 
markers provide the boundaries of a local pitch cycle. 
The prototype excitation waveform is extracted with 
a time resolution equal to that of the sampling rate of 
the sampled signal. 

The pitch-marker algorithm proposed in [9] is 
based on the fact that good-quality periodic speech 
can be obtained by exciting an LP synthesis niter 
with one delta impulse for each pitch cyclehThe loca- 
tions of these delta impulses are the pitch markers. 
They are determined using a dynamic programming 
framework employing a cost function that combines 
several perceptually important criteria, including the 
mean-squared error between original and recon- 
structed speech, smoothness of successive pulse am- 
plitudes and pulse intervals, and deviations of the 
pulse intervals from an initial average pitch estimate. 
To find the pitch markers the accumulated cost is 
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minimised over an interval of around 30 ms in dura- 
tion. 

In our discussion of thepitch*marker-based extrac- 
tion we will focus on a method which is aimed mainly 
at operation in conjunction with the definition of the 
extended prototype through zero-padding (Eq. (4))- 
When using zero-padded prototype waveforms, it is 
essential to rninimize the discontinuities during the 
interpolation of the prototype excitation waveforms. 
Thus, the prototype waveforms must be extracted 
such that the pitch marker is located sufficiently far 
from its eudpointe. 

We now discuss a consistent manner of extracting 
the prototype excitation waveforms. Figure 5 illus- 
trates the definition of prototype excitation wave- 
forms based on pitch markers obtained with the 
aforementioned method. The speech signal is parti- 
tioned into frames of equal length: ( At an 8-kHz sam- 
pling rate a typical frame length is 200 samples.) Let 
us denote the pitch-marker locations in sampling 
units within the current frame as /Hi ■ ■ ■ » n K and the 
first pitch-marker location in the next frame as n K+1 . 
(Note that the subscripts refer here to the pitch 
markers and are not identical to the pitch-cycle index 
of the reconstructed signal-) For each frame, we de- 
fine a prototype excitation waveform for an interval 
of length p ( K) around pitch marker n H that is limited 
by the midpoints between the pitch markers at n K ^ x 
and n Kl and pitch markers at n K and n K ^i 

p{K) » i(n K - n K _ x ) + *(n K+a - ?x K ) 

-"t(*jr + i-*ir-i>- (24) 

The sampled, unquantized prototype excitation wave- 
form v{K, nT), where T is the sampling period, is 
obtained by multiplying the LP residual e(nT) with a 
rectangular window of length p (K) , 

v{K, nT) » e(nT + j(p(X) + n K + n K _JT) 

xE(p(X),nT), (25) 

withH(p(X), nT) as denned in Eq. (11). 

The pitch-marker method can also be used to ob- 
tain prototype waveforms which can be extended peri- - 



odically (as in Eq. ( 12) ) . If the square window of Eq. 
(25) is replaced by an asymmetric, tapered window, 
extending from n^-i to n^^ x , with its maximum at n Kf 
then the Fourier transforms of the resulting extended, 
windowed prototype waveforms can be used to define 
a periodic prototype waveform. 

3.2JJL Prototype excitation waveform extraction 
based on maximising the prediction gain. The pres- 
ent procedure extracts a pitch-cycle waveform, which 
is continuous when extended periodically (Eq. ( 12 ) ) . 
For satisfactory results, it is necessary to increase the 
time resolution beyond the usual 8-kHz sampling rate 
of the speech signal. Similar benefits of increased reso - 
lution were earlier observed for the pitch description 
of CELP [16] . However, in contrast to current imple- 
mentations Of CELP, the PWI method requires the 
precise pitch period only for obtaining an accurate 
description of the prototype and the degree of peri- 
odicity; a coarsely quantized version can be used for 
transmission. 

The waveform is extracted with what will be re- 
ferred to as the maxzmum-prediction-gain criterion: 
given a starting point of a speech interval, find the 
interval length which results in maximum short-term 
prediction gain of the periodic signal obtained by re- 
peating the interval. In general, if one repeats an in- 
terval of arbitrary length, a discontinuity will exist at 
the boundary points (where the left side of the origi- 
nal-interval meets th^ right side) , as is. illustrated in 
Fig. 6. The periodic signal cannot be predicted across 
these boundaries, and the prediction gain will be 
lower as a result. However, when the interval length is 
chosen to be equal to the pitch period, Le. s if the peri- 
odicity of the signal equals that of the original speech 
signal, then the periodic signal can be predicted 
across these boundaries, and the prediction gain will 
be high. 

One method for implementing the maxhnum-pre- 
diction-gain criterion is to obtain a band- limited 
Fourier series for each candidate interval. Let p de- 
note the candidate interval length in sampling-period 
units (p is, in general, not integer) * A periodic signal, 
band-limited at half the sampling frequency, can be 
represented by a Fourier series with j harmonics, 
-where-;'* « pf 2 -< j l;-Thus^we .can -fit- exactly the 
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FIG. 5i Extraction of prototype excitation waveforms based on pitch markers- 
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HC 6- The mtommn^pr^iction-jain ciiterieKt (a) original 
speech signal, <b) repetition of an ettitrary speech augment, and 
(c) repetition of exactly one pitch period. 



Fourier series to the speech samples if we center the 
trial interval of length p at an existing sample. The 
procedure is best interpreted as a band-limited inter- 
polation for a nonuniformly sampled periodic signal 
( the sampling pattern being the same for each period, 
consisting of i identical sampling intervals, and one 
different sampling interval) < Upon obtaining a Four- 
ier series of M harmonics s the autocorrelation func- 
tion can be computed for all desired lags. Given "these 
autocorrelation values, the short-term predictor coef- 
ficients for a periodic signal sampled at the original 
sampling rate can be computed. From these predictor 
coefficients the prediction gain can be obtained 
Other, more efficient methods of pitch-cycle extrac- 
tion using the maximum-prediction -gain criterion 
will be discussed elsewhere. 

The maximum -pre diction - gain procedure results in 
a continuous, periodic, band-limited signal, of pitch 
period p(K), represented by a finite Fourier series. 
Let us denote the extracted prototype * (t) as 



M r 
wo L 



A m C06 



f^U^inf^l, (26) 



\P{K)) 



where we included the parameter B 0 for convenience 
of notation only* The number of harmonics Mis a 
function of the pitch period, p(X), and the cut-off 
frequency (usually the Nyquist frequency of the sam- 
pled speech signal ) . 

The prototype excitation waveform is obtained by 
filtering a sampled version of the periodic speech sig- 
nal (assumed to be band-limited) with the digital LP 
filter with filter coefficients . * . , <v, followed 

by ideal low-pass filtering at the Nyquist frequency, 



" * / llTCmjr - nT)\ 

))■ 



+ B~ain 



(27) 



where T is the sampling interval (used for the digital 
filtering). Equation (27) implies that, if s{r) repre- 
sents the speech waveform at the update time C^o*. 
then the Fourier series coefficients of the unaligned 
excitation waveform, at the endpoint of the interpola- 
tion intervalrfjo are given by 



n— 0 



^ " . (2irmnT\ 



where C m and are the coefficients for the cosine 
and sine basis functions, respectively- The tilde indi- 
cates that the prototype excitation waveform is not 
aligned with the previous prototype excitation wave- 
form. The extended prototype excitation waveform 
can be written as 



+j5 -'Mfi)]-' (29) 

where the dependence of the Fourier series coeffi- 
cients on the pitch-cycle index was made explicit by 
giving them the argument k — K- 

3.2. Interpolation of the Prototype Waveform 

5.2.2, Interpolation with sampled time-domain 
description. The sampled time-domain method is 
aimed at good performance at a minimal computation 
cost. The time-domain resolution is limited to 8 kHz 
during the entire- mterpolatrra process. The proto- 
type excitation waveform is described by discrete sam- 
ples obtained with the extraction method of Subsec- 
tion 3.1.1. 

The first step upon obtaining the sampled proto- 
type excitation waveforms is their alignment. An ef- 
fective procedure for time alignment of the sampled 
prototype excitation waveforms takes advantage of 
the implicit knowledge of the location of the pitch 
markers. The present prototype excitation waveform 
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u K (nT) is shifted according to Eq. (6) such that its 
pitch-marker position is aligned with the one in the 
(recentered) previous prototype excitation waveform 
^(nTKThen Eq. (5) reduces to 



(30) 



where H 0 and n K are the pitch-marker locations with 
respect to the center of the window of Eq. (25 ) . 

With this specification of the alignment procedure, 
interpolation of the sampled prototype excitation 
waveforms is fully described by Eqa, ( 1 ) - ( 11 ) of Sec- 
tion 2. In natural speech, the number of harmonics 
occasionally doubles or halves. To prevent unnatural 
warps of the pitch period in these cases, the smaller 
pitch period is repeated an integer number of times, 
such that it most closely matches the length of the 
larger pitch period. 

We now define in more detail the distortion crite- 
ria, which are used for adjustment of the SCR, and the 
quantization process (and as an alternative for the 
alignment procedure of Eq. (30)) . The distortion cri- 
teria of Eqs* (21) and (22) are appropriate for the 
time-domain interpolation of sampled waveforms. To 
describe this distortion criterion for the discrete sig- 
nal it is advantageous to approximate the all-pole LP- 
synthesis filter by its truncated impulse response Hq, 
h^, . . . , h R . (A value of around 25 suffices for R.) To 
take into account the spectral masking of the human 
auditory system, this impulse response is modified by 
the perceptual weighting factor y to account for spec- 
tral masking: h 0 , t^. *Y%, - ► . . Y^j*- (The percep- 
tual weighting factor usually has a value of around 
0.8.) The perceptually weighted response y to a vector 
u describing a sampled version u (k> nT) of the zero- 
padded continuous prototype excitation waveform is 



y ^ Hu, 



(31) 



where H is 



ho 


0 




ho 


T*h R 




0 




0 





0 



(32) 



Thus, Eqs. (21) and (22), expressed in the time do- 
main, become for the sampled time-domain descrip- 
tion (v describes v(l t nT)) 



D x luik,nT) 9 vU.nT)) 

_(n- v) T H T H(u- v) 



D„[u{k.nThu{l, nT)) 
* 1 - 



(34) 



(33) 



If u is considered to be the target excitation [ 4 ] , and v 
a Candidate excitation vector from the codebook, then 
the distortion measure of Eq. (34) is of the same form 
as that used in the quantization of the excitation 
function in CELP ( note that u T H T Hu is a constant) . 
However, the interpretation of the target excitation 
differs. In the CELP algorithm, the continuous up- 
date allows one to correct errors in the zero-input re- 
sponse in the current frame of the reconstructed sig- 
nal (the target excitation waveform is not equal to the 
residual signal) . In the PWI method, the target exci- 
tation is identical to the actual excitation, because we 
consider only one pitch cycle per update interval and 
correction for previous quantization errors is not ap- 
plicable. 

3.2.2. Prototype waveform interpolation based an 
the Fourier series description. We now discuss a 
practical interpolation of the continuous interpola- 
tion described in Subsection 2,2-3- The periodic wave- 
forms are described with a Fourier series which can be 
obtained with the procedures discussed earlier. In the 
implementation of the continuous interpolation 
method, we take advantage of the fact that linear in- 
terpolation of the shape of the prototype excitation is 
equivalent to a Hwpjit interpolation of their. Fourier 
series coefficients. 

If the pitch period has changed over the interpola- 
tion interval, the number of harmonics of the proto- 
types representing the endpoints may not be equal, 
For the prototype with the lesser number of harmon- 
ics, the "missing" harmonics have moved beyond the 
Nyqui6t frequency and have been removed by The 
anti-aliasing filter of the analog- to-digital converter. 
To facilitate interpolation, harmonics of aero ampli- 
tude are added to the prototype with the lesser num- 
ber of harmonics. To prevent unnatural warps of the 
fundamental frequency during pitch doubling or pitch 
halving, the prototype waveform with the smaller 
pitch period is again repeated an integer number of 
times, such that it most closely matches the length of 
the larger pitch-period- In the present method^ this is 
equivalent to interspacing 2ero -amplitude harmonics 
between the original harmonics for the prototype 
with less harmonics. 

Introducing spectral weighting consists of the in- 
verse of the operation described by Eq. ( 28) . If C m { k ) 
and DJ, k } are the coefficients of the Fourier series of 
the excitation function representing the center of 
pitch cycle h, then the coefficients of the spectrally 
weighted Fourier series, E m {k) for cosine and FJ^k) 
sine basis functions, are 
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p(fe) 

' 2irmjiT 



. (2^mnT\ 



(Z*mnT\ * ... rw . . 2TrmnT \ 



(35) 



where f is the peroeptualweig hting fac tor, discussed earlier in Subsection 3.2.1- 

It is convenient to use a -rector notation to describe the distortion criteria- The Fourier series coefficients of 
u(k, t) are represented by the vector 

U - [C 0 W + JD 0 (k) JAW - - " C M (fe) + ] H , (36) 

where the superscript H denotes the Hermitian transpose (conjugate transpose). The linear mapping ofEq. (35) 
can now be described by the following diagonal matrix W: 

„ (2irmnT\ L .~« „ . ( 2vmnT \ 



w ~ 



(37) 



The vector of Fourier aeries coefficients of the spec- 
trally weighted signal is then WTL The distortion 
measures can then be written as 



Z5 t U<*.T),0(Z,T)) 



(U- V) B W H W(U - V) 



(38) 



13— (u(fc f r),!>(I F r)) 



(Re{U«W g WV}) 2 

= u H w H mi v H w H wv * v J 

where V is the vector of Fourier series coefficients of 
v ( I, r ) . If only certain frequency bands are to be con- 
sidered another diagonal weighting matrix can be 
added to Eqs, (38) and (39) . Using either distortion 
measure (38) or (39), the alignment procedure of Eq. 
(18) reduces to 

} K = argmax(Re { U H W H WV } ) * 

M 

— argmax 2 



{ {E„{Q)E m (K)+ FJ0)FJK) )cw(2*m{l e ) 
+ (F m {Q)EjK) - ^(0)F m W)sin(2irm|' K )}- C 40 ) 

The Fourier series coefficients for the properly time- 
aligned prototype excitation waveform axe now 

CJK) = C m (K)cos(2xm| JC ) - D m {K)suL(2irm$ K ) , 
D m (K) - C in (K)sm(2^m| if ) 

+ D m {K)cos{2Trml K ). (41) 

Once two (quantized) aligned prototype excitation 
waveforms have been created, the interpolation pro- 
cedure can proceed according to Eq. ( 19 ) . An insight- 
ful expression for the interpolation process is ob- 
tained by first introducing the interpolation function 

aU>: 



*(0 = 



c(t) 



(42) 



Then, by writing the integration of Eq* ( 17 ) explicitly 
in Eq. ( 20 ) t we obtain, over the interpolation interval 
to < t < t K , 



*(D- 2 [(1 - *it))CJK) 4- *(r)C m (0)]co S J _ a(n)p(0) + a(t > )p(K) ) 

t ;^TT^(? 



»p(0) + *(f)pun. 



(43) 
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Note that interpolation of the pitch periods as de- 
scribed in this paper generally leads to a recon- 
structed speech signal that is not synchronous with 
the original speech. However, this has no effect on the 
perceived speech quality. If pitch-synchronous output 
speech is desired one may apply prototype waveform 
interpolation without interpolating the pitch periods 
but using the periods as denned by the pitch markers. 
In this case, all pitch markers of a frame must be en- 
coded and transmitted which increases the total bit 
rate by about 760 b /s compared to pitch interpolation 
[9]- (If the synchronicity is desired only at the end- 
point of the interpolation interval, then the pitch pe- 
riod p(20 can be suitably adjusted to achieve this. 
Additional information is required to prevent propa- 
gation of errors in this case.) 

Figure 8 compares the narrowband log-magnitude 
spectra of the original speech of a female speaker with 
that reconstructed with PWI at a bit rate of 1.7 kb/s, 
without inclusion of the short-term SCR correction. 
The spectra were computed by applying a discrete 
Fourier transform to a 40-ms speech segment. Both 
signals are band-limited to the frequency range 0,2- 
3.8 kHz. The envelope and the harmonic structure of 
both spectra coincide reasonably welL Since the 
short-term SCR is not considered the reconstructed 
speech occasionally becomes more periodic in certain 
frequency intervals than the original speech. This can 
be seen in Fig. 8 for frequencies above 3 kHz. In addi- 
tion, the pitch interpolation may sometimes cause rel- 
atively large changes from the initial pitch periods as 
defined by the pitch markers. This may result in a 
slight frequency shift of the harmonics of the syn- 
thetic speech compared to the original speech. De- 
spite the fact that the speech quality improves signifi- 
cantly with increasing bit rate, a comparison of the 
log-magnitude spectra of reconstructed speech at 1.7 
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FIG. 8. Comparison of spectra of the original speech ( solid line ) 
and the reconstructed speech using PWI at 1.7 kb / b { dashed Una ) . 



kb/s and higher rates does not reveal significant dif- 
ferences. 

5, CONCLUSIONS 



The interpolation of prototype waveforms leads to 
excellent voiced speech quality at bit rates between 
2.5 and 4 kb /s. Unlike low-bit-rate CELP, the speech 
is not distorted by background noise. We illustrated 
the method with several linear-prediction-based tech- 
niques. The resulting reconstructed speech signal is, 
in general, not pitch synchronous .with the, original 
signal but displays a similar waveform. The prototype 
concept facilitates a relatively straightforward speech 
synthesis system that generates natural sounding 
voiced $peech with a purely harmonic description of 
the excitation signal, and linear interpolation over 20- 
to 30-ms intervals. The interpolation procedure facili- 
tates generation of a high level of periodicity , even at 
low bit rates. Simple procedures can be applied to 
control the degree of periodicity. Although the PWI 
method, similarly to single-pulse coding, is currently 
aimed at voiced speech only, it combines easily with 
other linear-prediction-based coders such as CELP, 
which can be used for unvoiced speech. 

When the present method is compared to sinusoi- 
dal speech coding algorithms, it should be noted that 
the present method does not require encoding.of fre- 
quency Onsets, birth and death of harmonics (except 
for pitch doubling or halving) , and polynomial inter- 
polation. The phases of all harmonics of the excita- 
tion signal are implicitly encoded by a quantized vec- 
tor describing the prototype excitation waveform. For 
voiced speech, the low update rate of the pitch parame- 
ter and the inclination toward periodicity are the 
main advantages when compared to current CELP 
coders. 

Although this was not the primary goal of the PWI 
analysis-synthesis system, it can be used for time and 
pitch-period scaling. By simply changing the distance 
over which the prototypes are interpolated, a time- 
scaled signal is obtained. This was found to work ex- 
tremely well over a large range of scaling values. Pitch 
scaling is sinularly strai^^ the 
pitch in the synthesizer by the desired amount, natu- 
ral sounding speech with a modified pitch is produced. 
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Despite the appearance of an integral, Eqs. (42) and 
(43 ) can be used directly to implement the algorithm. 
This is because a slow wandering of the phase of the 
speech signal waveform is of no practical significance. 
Thus, numerical integration of (17) is a convenient 
method of obtaining the argument of the sine and co- 
sine terms- Alternatively, <(r) can be computed with 
the analytic expression of Eq, (17) and 2«(t) can be 
used as argument for the sine and cosine functions of 
Eq. (43). 

3.3. Quantization of the Prototype Waveforms 

Since quantization of the coefficients can be per- 
formed with standard procedures, we focus here on- 
the quantization of the prototype excitation wave- 
form. In the simplest case, each prototype excitation 
waveform can be represented as a single impulse, re- 
ducing the PWI coding procedure to an LP-based vo- 
coder. 

The following method of quantization can be used 
for ail descriptions of the prototype excitation wave- 
form discussed before. The prototype excitation wave- 
forms are encoded differentially. The objective is to 
approximate the current, aligned prototype excita- 
tion waveform, it- ( K 7 t ) , with a contribution from the 
previous, quantized excitation waveform, v ( 0, r ) , and 
contributions from one or more codebooks, (The 
alignment procedure is advantageously performed 
with respect to the previous, aligned, quantized proto- 
type excitation waveform.) The quantized excitation 
at the endpoint of the interpolation interval, t Kl is 

u(K v t) - X 0 u(0 f r) + 2 Vm>,(t), (44) 

where c\%{r) is the waveform entry with index q t 
from codebook Z, and the \ are scaling factors. The 
waveform can be quantized by minimizing the dis- 
tance measure of Eq. ( 33 ) for the set of values of { X 0 , 

{Xo, - - - , X r , g lt — F <Ji} 

argmin D x {u(K, r), Xj,u(0 p r) 

+ I V&n " (45), 

To allow a sequential optimization of these parame- 
ters, without reducing the performance, it is useful to 
orthogonalize the codebook entries of each quantiza- 
tion codebook to the wmning entries of the previous 
optimization stages. 

Note that the differential encoding of the prototype 
excitation waveform is similar in concept to the quan- 
tization method of CELP where one first applies a 



closed* loop pitch prediction [3], or adaptive code- 
book [4], and then a fixed (stochastic) codebook. 

We use this differential coding scheme with two 
codebooks. The first codebook consists of only single 
pulses in the time domain, aimed at accurately model- 
ing the pitch pulse (in the sampled- time -domain no* 
tation this corresponds to a single nonzero sample; in 
the Fourier series notation a band-limited pulse is 
used) . Similarly to results for the fixed codebook of 
CELP [4] , we have found that a good choice for the 
second codebook is one with entries having a sparse 
set of pulses in the time domain. 

We have obtained excellent speech quality using 
three timed 5 bits for the three gain factors, two times 
8 bits for the codebooks, and 7 bits for the pitch. At an 
update rate of 50 Hz, this corresponds to a bit rate of 
1.9 kb/s for the excitation signal. It is likely that this 
bit rate can be reduced when further refinements are 
introduced. In particular, it is likely that the PWI- 
coder efficiency can be improved by training the code- 
books. 

The LP coefficients can be encoded as in other LP- 
based systems. We have obtained good results with 
the efficient vector quantization method described 
in [17]. 

4. RESULTS 

An example of waveforms in PWI speech coding is 
illustrated in Fig. 7. Figure 7a shows the original 
speech waveform for a voiced interval and Fig. 7b the 
pitch markers. The prototype excitation waveforms, 
which are delimited by the dotted lines, were ex- 
tracted using pitch markers as described in Subsec- 
tion 3.1.1, In the example shown, we applied block- 
wise interpolation with fixed time scale. 

The lowest possible bit rata in PWI coding is ob- 
tained if each prototype is represented by a single im- 
pulse. For this case, PWI reduces to LP vocoding. Fig- 
ures 7c and 7d show the resulting excitation and re- 
constructed speech waveforms, respectively. Each 
prototype waveform is represented by its pitch period 
p (K) , the impulse amplitude, and the set of LP coeffi- 
cients- By allocating 7 bits for the pitch period, 8 bits 
for the impulse amplitude, and 24 bits for the vector 
quantization of the filter parameters, the overall bit 
rate amounts to 1.7 kb/s for an update interval of 25 
ms (3 bits are used for the voiced/ unvoiced classifica- 
tion with a resolution of 50 samples ) . At this bit rate 
PWI achieves only a vocoder-like speech quality and 
suffers £xom some buzzirtess. 

To obtain high-quality speedy the prototype exci- 
tation waveforms must be quantized more accurately. 
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PROTOTYPE WAVEFORM 



FIG. 7, Waveforms in PWI speech coding with prototype excitation waveform attraction based on pitch markers and with faed time-scale 
bloclwis* interpolation (a) original speech, (b) pitch markers, (c) interpolated excitation, and (d) reconstructed speech at an overall bit 
rate Of 1,7 leb/s; ( e ) interpolated excitation and ( f ) reconstructed speech at an overall bit rate of 2_6 kb /s; ( g) interpolated excitation and 
(h) reconstructed speech obtained with unqudntiied prototype waveforms. 



Figures 7e and 7f show the excitation and recon- 
structed speech waveforms, at an overall bit rate of 
2.6 kb /s. Differential encoding of the prototype exci- 
tation waveforms is performed as described in Sub- 
section 3.3. The long-term SCR is maintained identi- 
cal to that of the original signal by controlling the 
gains (no subbands vere used) . The bit allocation is 7 
bits for the pitch period, 8 bits for the single pulse 
location, 8 bits for a fixed codebook component, and 5 
bits each for the gain of the previous prototype, the 
single-pulse amplitude, and the gain of the fixed code- 
book component. The buzziness present in the recon- 
structed speech of the 1.7 kb/sexample is almost com- 
pletely removed. It is removed completely by main- 
taining the short-term SCR similar to that of the 
original signaL As mentioned before, for the majority 
of speakers this can be accomplished by injecting a 
fixed amount of noise for frequencies beyond 2 kHz. 



For comparison purposes, Figs, 7g and 7h show the 
excitation and the reconstructed speech waveforms 
obtained for the unquantized prototype waveforms. 
In this case, each prototype excitation waveform is 
identical to a pitch cycle of the original residual 
speech signaL The 'mismatch between the original 
and reconstructed speech is only due to the interpola- 
tion of the intermediate periods between two proto^ 
type waveforms. 

We have performed formal listening tests in which 
the voiced sections of the speech reconstructed by sev- 
eral coders was replaced with speech reconstructed by 
PWI. These tests indicate that the continuous inter- 
polation method, when operating at a rate of 2-6 kb / s, 
results in a voiced speech quality similar to that of 
CCITT-standard 32 kb/s AD PCM, When overlap- 
add procedures are not used, the blockwise interpola- 
tion methods perform marginally worse. 
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ABSTRACT 




A oppoch coding syslcm providing reconstructed voiced 
speech with a smoothly cygIvii^ piloh-cyck waveform. A 
speech signal i& represented by oakling and ending proto- 
lypc waveforms, each prototype waveform in an exemplary 
pilch-cycle of voiced speech. A coded prototype waveEbnn 
is ttaosmiUisd bL regular intervals to a receiver which syn- 
thesizes (or reconsirucls) an eftUm&le of the original speech 
aegmctf fcajod on the prototypes. The estimate of Lfi<? 
original speech signal Is provided uy a prulutype interpola- 
tion procofia which provides a smooth thnc*cvgl)iEioii of 
pitch-cycle waveform* In ihe reconstructed speech. 
Illustratively, a frame of odgual speech ia coded by fir*t 
filtering the frame with a linear predictive filter. Next a 
piicb-cyclc of the tittered original i* identified and extracted 
ajs b prototype waveforrti. The jpibtbty^ wovororm to then 
represented as a scl of Fourier Mines (frequency domain) 
coefficients. The pitch-period' and Fourier coefficients of the 
prototype, jbs well am J he pflranielerK nf (he linear predictive 
filler, are used to lepxoseei a frame of original speech 'Ihesc 
parameter* are coded by vector and scalar quantisation and 
oomrnuiiicatcd aver a chaiw.1 to a receiver which uses 
iiTfnrTnaTtOil rcpresenlnig two DOT&eaerivc frames to iccon- 
^trud thu uarlicr of the two frames based an » conliriuous 
proioiypc wsvefonn intarpoladon process. Waveform inter- 
poMfan nisy he combined with conventional CRT.P tccb- 
naques (or coding nnvoiced pardons of the original speech 
al 
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